You are consulting for an airline company looking to enter the United States domestic market which has identified medium and large airports as their desired operating locations. The company believes that it has a competitive advantage in maintaining punctuality, so it plans on making this a big part of its brand image with a motto, “On time, for you.” To kick start operations, the company has decided to start with 5 round trip routes. An example of a round trip route is the combination of JFK to ORD and ORD to JFK. The opposite order of the route, ORD to JFK and JFK to ORD, would be considered the same round trip.
#importing libraries
import pandas as pd
import numpy as np
import plotly.express as px
Flights.csv dataset, columns 3,13 and 14 have mixed datatypes. OP_CARRIER_FL_NUM, should be a string so we will use the str datatype here. AIR_TIME in minutes, should be a datatype of int, so we will use the str datatype and investigate further in next steps.DISTANCE in miles, should be a datatype of int, so we will use the str datatype and investigate further in next steps.#read airport codes dataset
airport_codes = pd.read_csv('Airport_Codes.csv')
#read flights dataset
flights = pd.read_csv('Flights.csv', dtype = {'OP_CARRIER_FL_NUM':str,'AIR_TIME':str,'DISTANCE':str})
#read tickets dataset
tickets = pd.read_csv('Tickets.csv')
We will first check to see what the impact would be, to both the AIR_TIME and DISTANCE columns, of losing values that are not of the numeric datatype. If the impact is small, we can assume this will not have a significant impact on our overall analysis.
#finding percent of nulls in the air_time column
air_time_null_percentage = (flights['AIR_TIME'].isnull().sum() / len(flights['AIR_TIME'])) * 100
#printing the output
print(f"Percentage of null values in 'AIR_TIME' column is {air_time_null_percentage:.2f}%")
Percentage of null values in 'AIR_TIME' column is 2.95%
#creating a test air_time column before converting values to the numeric datatype
flights['AIR_TIME_TEST'] = flights['AIR_TIME']
#converting test air_time column to numeric datatype and coercing non-convertible values
flights['AIR_TIME_TEST'] = pd.to_numeric(flights['AIR_TIME_TEST'], errors='coerce')
#finding percent of nulls in the air_time test column post-conversion
air_time_test_null_percentage = (flights['AIR_TIME_TEST'].isnull().sum() / len(flights['AIR_TIME_TEST'])) * 100
#printing the output
print(f"Percentage of null values in 'AIR_TIME_TEST' column is {air_time_test_null_percentage:.2f}%")
Percentage of null values in 'AIR_TIME_TEST' column is 3.05%
Data Quality Insight: The percentage of null values in the AIR_TIME_TEST column increases by 0.1%. This is a small enough increase that it will not significantlty impact our analysis, so we can proceed with converting the values within our original AIR_TIME column and coercing non-convertible values.
#dropping air_time test column
flights = flights.drop(columns=['AIR_TIME_TEST'])
#converting air_time column to numeric datatype and coercing non-convertible values
flights['AIR_TIME'] = pd.to_numeric(flights['AIR_TIME'], errors='coerce')
We can now do the same thing for the DISTANCE column.
#finding percent of nulls in the distance column
distance_null_percentage = (flights['DISTANCE'].isnull().sum() / len(flights['DISTANCE'])) * 100
#printing the output
print(f"Percentage of null values in 'DISTANCE' column is {distance_null_percentage:.2f}%")
Percentage of null values in 'DISTANCE' column is 0.03%
#creating a test distance column before converting values to the numeric datatype
flights['DISTANCE_TEST'] = flights['DISTANCE']
#converting test distance column to numeric datatype and coercing non-convertible values
flights['DISTANCE_TEST'] = pd.to_numeric(flights['DISTANCE_TEST'], errors='coerce')
#finding percent of nulls in the distance test column post-conversion
distance_test_null_percentage = (flights['DISTANCE_TEST'].isnull().sum() / len(flights['DISTANCE_TEST'])) * 100
#printing the output
print(f"Percentage of null values in 'DISTANCE_TEST' column is {distance_test_null_percentage:.2f}%")
Percentage of null values in 'DISTANCE_TEST' column is 0.14%
Data Quality Insight: The percentage of null values in the DISTANCE_TEST column increases by 0.1%. This is a small enough increase that it will not significantlty impact our analysis, so we can proceed with converting the values within our original DISTANCE column and coercing non-convertible values.
#dropping distance test column
flights = flights.drop(columns=['DISTANCE_TEST'])
#converting distance column to numeric datatype and coercing non-convertible values
flights['DISTANCE'] = pd.to_numeric(flights['DISTANCE'], errors='coerce')
#printing airport_codes dataset info
airport_codes.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 55369 entries, 0 to 55368 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 TYPE 55369 non-null object 1 NAME 55369 non-null object 2 ELEVATION_FT 48354 non-null float64 3 CONTINENT 27526 non-null object 4 ISO_COUNTRY 55122 non-null object 5 MUNICIPALITY 49663 non-null object 6 IATA_CODE 9182 non-null object 7 COORDINATES 55369 non-null object dtypes: float64(1), object(7) memory usage: 3.4+ MB
#finding percent of null iata_codes in airport dataset
iata_code_null_percentage = (airport_codes['IATA_CODE'].isnull().sum() / len(airport_codes['IATA_CODE'])) * 100
#printing the output
print(f"Percentage of null values in 'IATA_CODE' column is {iata_code_null_percentage:.2f}%")
Percentage of null values in 'IATA_CODE' column is 83.42%
#filtering the DataFrame to get rows where 'IATA_CODE' is null
us_airports = airport_codes[(airport_codes['ISO_COUNTRY'] == 'US')]
#getting the value counts of 'TYPE' in 'missing_iata_code'
type_counts_us_aiports = us_airports['TYPE'].value_counts()
#calculating the percentage of each value relative to the total count
total_us_aiports = type_counts_us_aiports.sum()
percentages = (type_counts_us_aiports / total_us_aiports) * 100
#combining the value counts and percentages into one dataframe
result_df = pd.DataFrame({'Value Counts': type_counts_us_aiports, 'Percentage of Total':percentages.apply(lambda x: f'{x:.2f}%')})
result_df
| Value Counts | Percentage of Total | |
|---|---|---|
| TYPE | ||
| small_airport | 13708 | 60.10% |
| heliport | 6268 | 27.48% |
| closed | 1392 | 6.10% |
| medium_airport | 687 | 3.01% |
| seaplane_base | 566 | 2.48% |
| large_airport | 171 | 0.75% |
| balloonport | 18 | 0.08% |
#checking for duplicate IATA_CODES
duplicates = airport_codes.duplicated(subset=['IATA_CODE'], keep=False)
# Get the rows with duplicate IATA_CODES
duplicate_rows = airport_codes[duplicates & ~airport_codes['IATA_CODE'].isnull()]
# Counting the distinct 'TYPE' values and their occurrences for each duplicate 'IATA_CODE'
duplicate_counts = duplicate_rows.groupby(['IATA_CODE', 'TYPE']).size().reset_index()
duplicate_counts.columns = ['IATA_CODE', 'TYPE', 'Count']
duplicate_counts
| IATA_CODE | TYPE | Count | |
|---|---|---|---|
| 0 | 0 | small_airport | 80 |
| 1 | AHT | closed | 2 |
| 2 | ARX | closed | 1 |
| 3 | ARX | medium_airport | 1 |
| 4 | AUS | closed | 1 |
| 5 | AUS | large_airport | 1 |
| 6 | CDT | medium_airport | 2 |
| 7 | CLG | closed | 1 |
| 8 | CLG | small_airport | 1 |
| 9 | DDU | small_airport | 2 |
| 10 | DLR | small_airport | 2 |
| 11 | DZI | small_airport | 2 |
| 12 | ESP | closed | 1 |
| 13 | ESP | small_airport | 1 |
| 14 | GGC | small_airport | 2 |
| 15 | HKG | closed | 1 |
| 16 | HKG | large_airport | 1 |
| 17 | IZA | medium_airport | 2 |
| 18 | JNB | large_airport | 3 |
| 19 | KCZ | medium_airport | 2 |
| 20 | KMM | small_airport | 2 |
| 21 | KWB | small_airport | 2 |
| 22 | LHR | closed | 1 |
| 23 | LHR | large_airport | 1 |
| 24 | LMC | small_airport | 2 |
| 25 | LPE | small_airport | 2 |
| 26 | MNI | closed | 1 |
| 27 | MNI | medium_airport | 1 |
| 28 | MPT | small_airport | 2 |
| 29 | MRE | medium_airport | 1 |
| 30 | MRE | small_airport | 1 |
| 31 | MUC | closed | 1 |
| 32 | MUC | large_airport | 1 |
| 33 | MXR | small_airport | 2 |
| 34 | NWT | small_airport | 2 |
| 35 | PCO | small_airport | 2 |
| 36 | PHM | closed | 3 |
| 37 | PRM | heliport | 1 |
| 38 | PRM | medium_airport | 1 |
| 39 | RCH | medium_airport | 1 |
| 40 | RCH | small_airport | 1 |
| 41 | RMD | small_airport | 2 |
| 42 | RTI | small_airport | 2 |
| 43 | RZS | small_airport | 2 |
| 44 | SGL | heliport | 1 |
| 45 | SGL | small_airport | 1 |
| 46 | SHO | closed | 1 |
| 47 | SHO | medium_airport | 1 |
| 48 | SVD | closed | 1 |
| 49 | SVD | medium_airport | 1 |
| 50 | ULG | small_airport | 2 |
| 51 | VQS | closed | 1 |
| 52 | VQS | small_airport | 1 |
| 53 | YTY | medium_airport | 2 |
| 54 | ZRZ | small_airport | 2 |
Data Quality Insight: airports_codes contains duplicate IATA_CODES, so we will need to deduplicate thia dataset, while prioritizing rows where TYPE = medium_airport. Since we are only concerned with medium & large airports, we will create a new dataframe where TYPE is one of these two.
#filtering for rows where 'TYPE' is 'medium_airport' or 'large_airport'
filtered_airport_codes = airport_codes[airport_codes['TYPE'].isin(['medium_airport', 'large_airport'])]
filtered_airport_codes.head()
| TYPE | NAME | ELEVATION_FT | CONTINENT | ISO_COUNTRY | MUNICIPALITY | IATA_CODE | COORDINATES | |
|---|---|---|---|---|---|---|---|---|
| 6194 | medium_airport | Aleknagik / New Airport | 66.0 | NaN | US | Aleknagik | WKK | -158.617996216, 59.2826004028 |
| 10444 | medium_airport | Honiara International Airport | 28.0 | OC | SB | Honiara | HIR | 160.05499267578, -9.4280004501343 |
| 10449 | medium_airport | Munda Airport | 10.0 | OC | SB | NaN | MUA | 157.26300048828125, -8.327969551086426 |
| 10471 | medium_airport | Hongyuan Airport | 11600.0 | AS | CN | Aba | AHJ | 102.35224, 32.53154 |
| 10695 | medium_airport | Nauru International Airport | 22.0 | OC | NR | Yaren District | INU | 166.919006, -0.547458 |
#printing flights dataset info
flights.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1915886 entries, 0 to 1915885 Data columns (total 16 columns): # Column Dtype --- ------ ----- 0 FL_DATE object 1 OP_CARRIER object 2 TAIL_NUM object 3 OP_CARRIER_FL_NUM object 4 ORIGIN_AIRPORT_ID int64 5 ORIGIN object 6 ORIGIN_CITY_NAME object 7 DEST_AIRPORT_ID int64 8 DESTINATION object 9 DEST_CITY_NAME object 10 DEP_DELAY float64 11 ARR_DELAY float64 12 CANCELLED float64 13 AIR_TIME float64 14 DISTANCE float64 15 OCCUPANCY_RATE float64 dtypes: float64(6), int64(2), object(8) memory usage: 233.9+ MB
flights.head()
| FL_DATE | OP_CARRIER | TAIL_NUM | OP_CARRIER_FL_NUM | ORIGIN_AIRPORT_ID | ORIGIN | ORIGIN_CITY_NAME | DEST_AIRPORT_ID | DESTINATION | DEST_CITY_NAME | DEP_DELAY | ARR_DELAY | CANCELLED | AIR_TIME | DISTANCE | OCCUPANCY_RATE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-03-02 | WN | N955WN | 4591 | 14635 | RSW | Fort Myers, FL | 11042 | CLE | Cleveland, OH | -8.0 | -6.0 | 0.0 | 143.0 | 1025.0 | 0.97 |
| 1 | 2019-03-02 | WN | N8686A | 3231 | 14635 | RSW | Fort Myers, FL | 11066 | CMH | Columbus, OH | 1.0 | 5.0 | 0.0 | 135.0 | 930.0 | 0.55 |
| 2 | 2019-03-02 | WN | N201LV | 3383 | 14635 | RSW | Fort Myers, FL | 11066 | CMH | Columbus, OH | 0.0 | 4.0 | 0.0 | 132.0 | 930.0 | 0.91 |
| 3 | 2019-03-02 | WN | N413WN | 5498 | 14635 | RSW | Fort Myers, FL | 11066 | CMH | Columbus, OH | 11.0 | 14.0 | 0.0 | 136.0 | 930.0 | 0.67 |
| 4 | 2019-03-02 | WN | N7832A | 6933 | 14635 | RSW | Fort Myers, FL | 11259 | DAL | Dallas, TX | 0.0 | -17.0 | 0.0 | 151.0 | 1005.0 | 0.62 |
#printing tickets dataset info
tickets.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1167285 entries, 0 to 1167284 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ITIN_ID 1167285 non-null int64 1 YEAR 1167285 non-null int64 2 QUARTER 1167285 non-null int64 3 ORIGIN 1167285 non-null object 4 ORIGIN_COUNTRY 1167285 non-null object 5 ORIGIN_STATE_ABR 1167285 non-null object 6 ORIGIN_STATE_NM 1167285 non-null object 7 ROUNDTRIP 1167285 non-null float64 8 REPORTING_CARRIER 1167285 non-null object 9 PASSENGERS 1165308 non-null float64 10 ITIN_FARE 1166325 non-null object 11 DESTINATION 1167285 non-null object dtypes: float64(2), int64(3), object(7) memory usage: 106.9+ MB
Data Quality Insight: At the moment, the ITIN_FARE column is of the object datatype, but we will need this column to a numeric datatype in order to answer Ask #2. We will follow the steps we did above for both the DISTANCE and AIR_TIME columns in the flights dataset.
#finding percent of nulls in the itin_fare column
itin_fare_null_percentage = (tickets['ITIN_FARE'].isnull().sum() / len(tickets['ITIN_FARE'])) * 100
#printing the output
print(f"Percentage of null values in 'ITIN_FARE' column is {itin_fare_null_percentage:.2f}%")
Percentage of null values in 'ITIN_FARE' column is 0.08%
#creating a test distance column before converting values to the numeric datatype
tickets['ITIN_FARE_TEST'] = tickets['ITIN_FARE']
#converting test distance column to numeric datatype and coercing non-convertible values
tickets['ITIN_FARE_TEST'] = pd.to_numeric(tickets['ITIN_FARE_TEST'], errors='coerce')
#finding percent of nulls in the distance test column post-conversion
itin_fare_test_null_percentage = (tickets['ITIN_FARE_TEST'].isnull().sum() / len(tickets['ITIN_FARE_TEST'])) * 100
#printing the output
print(f"Percentage of null values in 'ITIN_FARE_TEST' column is {itin_fare_test_null_percentage:.2f}%")
Percentage of null values in 'ITIN_FARE_TEST' column is 0.35%
Data Quality Insight: The percentage of null values in the ITIN_FARE_TEST column increases by 0.3%. This is a small enough increase that it will not significantlty impact our analysis, so we can proceed with converting the values within our original ITIN_FARE column and coercing non-convertible values.
#dropping distance test column
tickets = tickets.drop(columns=['ITIN_FARE_TEST'])
#converting distance column to numeric datatype and coercing non-convertible values
tickets['ITIN_FARE'] = pd.to_numeric(tickets['ITIN_FARE'], errors='coerce')
#verifying the data type conversion was excuted correctly
tickets.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1167285 entries, 0 to 1167284 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ITIN_ID 1167285 non-null int64 1 YEAR 1167285 non-null int64 2 QUARTER 1167285 non-null int64 3 ORIGIN 1167285 non-null object 4 ORIGIN_COUNTRY 1167285 non-null object 5 ORIGIN_STATE_ABR 1167285 non-null object 6 ORIGIN_STATE_NM 1167285 non-null object 7 ROUNDTRIP 1167285 non-null float64 8 REPORTING_CARRIER 1167285 non-null object 9 PASSENGERS 1165308 non-null float64 10 ITIN_FARE 1163149 non-null float64 11 DESTINATION 1167285 non-null object dtypes: float64(3), int64(3), object(6) memory usage: 106.9+ MB
tickets.head()
| ITIN_ID | YEAR | QUARTER | ORIGIN | ORIGIN_COUNTRY | ORIGIN_STATE_ABR | ORIGIN_STATE_NM | ROUNDTRIP | REPORTING_CARRIER | PASSENGERS | ITIN_FARE | DESTINATION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 201912723049 | 2019 | 1 | ABI | US | TX | Texas | 1.0 | MQ | 1.0 | 736.0 | DAB |
| 1 | 201912723085 | 2019 | 1 | ABI | US | TX | Texas | 1.0 | MQ | 1.0 | 570.0 | COS |
| 2 | 201912723491 | 2019 | 1 | ABI | US | TX | Texas | 1.0 | MQ | 1.0 | 564.0 | MCO |
| 3 | 201912723428 | 2019 | 1 | ABI | US | TX | Texas | 1.0 | MQ | 1.0 | 345.0 | LGA |
| 4 | 201912723509 | 2019 | 1 | ABI | US | TX | Texas | 0.0 | MQ | 1.0 | 309.0 | MGM |
Find the 10 busiest round trip routes in terms of number of round trip flights in the quarter. Exclude canceled flights when performing the calculation.
Data Quality Insight: There is not a clean way to join flights to tickets in order to determine which flights are roundtrip flights and which are not, but we can use the combination of ORIGIN and DESTINATION to create a ROUNDTRIP_ROUTE column. To do this we will concatenate both columns, using the sorted() function to ensure the same roundtrip route is assigned the two scenarios outlined above (and below).
"An example of a round trip route is the combination of JFK to ORD and ORD to JFK. The opposite order of the route, ORD to JFK and JFK to ORD, would be considered the same round trip."
#creating 'ROUNDTRIP_ROUTE' column in flights dataset
flights['ROUNDTRIP_ROUTE'] = flights.apply(lambda row: '-'.join(sorted([row['ORIGIN'], row['DESTINATION']])), axis=1)
#filtering for non-canceled flights
non_canceled_flights = flights[flights['CANCELLED'] == 0]
#getting counts for top 10 roundtrip routes
unique_routes_counts = non_canceled_flights['ROUNDTRIP_ROUTE'].value_counts().head(10).reset_index()
#renaming the columns in unique_routes_counts
unique_routes_counts.columns = ['ROUNDTRIP_ROUTE', 'COUNT']
#printing output
unique_routes_counts
| ROUNDTRIP_ROUTE | COUNT | |
|---|---|---|
| 0 | LAX-SFO | 8340 |
| 1 | LGA-ORD | 7156 |
| 2 | LAS-LAX | 6511 |
| 3 | JFK-LAX | 6320 |
| 4 | LAX-SEA | 4999 |
| 5 | BOS-LGA | 4820 |
| 6 | HNL-OGG | 4794 |
| 7 | PDX-SEA | 4774 |
| 8 | ATL-MCO | 4707 |
| 9 | ATL-LGA | 4594 |
#creating bar chart for 10 Busiest Round Trip Routes
fig = px.bar(unique_routes_counts, x='ROUNDTRIP_ROUTE', y='COUNT', text_auto='.2s',
labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'COUNT': 'Trip Count'},
title='10 Busiest Round Trip Routes')
fig.update_layout(
xaxis=dict(tickfont=dict(size=12), title_standoff=10),
yaxis=dict(tickfont=dict(size=12), title_standoff=10),
title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))
fig.show()
The 10 most profitable round trip routes (without considering the upfront airplane cost) in the quarter. Along with the profit, show total revenue, total cost, summary values of other key components and total round trip flights in the quarter for the top 10 most profitable routes. Exclude canceled flights from these calculations.
Data Quality Insight: In order to calculate round trip route profits, we must:
tickets dataframe, as we previously did in the flights dataset, which we will call ROUNDTRIP_ROUTE.ROUNDTRIP_ROUTE.#creating 'ROUNDTRIP_ROUTE' column in tickets dataset
tickets['ROUNDTRIP_ROUTE'] = tickets.apply(lambda row: '-'.join(sorted([row['ORIGIN'], row['DESTINATION']])), axis=1)
#filtering tickets for round trips
round_trip_tickets = tickets[tickets['ROUNDTRIP'] == 1]
#calculating average rountrip route fare
avg_itin_fare = round_trip_tickets.groupby('ROUNDTRIP_ROUTE')['ITIN_FARE'].mean().reset_index()
#calcualting sum of miles and average occupancy rate by ROUNDTRIP_ROUTE usinf non_canceled_flights
round_trip_flight_info = non_canceled_flights.groupby('ROUNDTRIP_ROUTE').agg({
'ORIGIN': 'first',#adding the first 'ORIGIN' value associated with each route
'DESTINATION': 'first', #adding the first 'DESTINATION' value associated with each route
'ARR_DELAY': 'mean', # calculating the average arrival delay for each route
'DEP_DELAY': 'mean', # calculating the average departure delay for each route
'DISTANCE': 'first', # getting the first DISTANCE value associated with each route (each route has the ONE specified distance in the flights dataset so no need to grab the mean here)
'OCCUPANCY_RATE': 'mean', # calculating the average occupancy_rate for each route
}).reset_index()
#merging round_trip_flight_info and avg_itin_fare on'ROUNDTRIP_ROUTE' to create 'round_trips'
round_trips = pd.merge(round_trip_flight_info, avg_itin_fare, on='ROUNDTRIP_ROUTE')
#renaming column names
round_trips.rename(columns={
'ARR_DELAY': 'AVERAGE_ARR_DELAY',
'DEP_DELAY': 'AVERAGE_DEP_DELAY',
'OCCUPANCY_RATE': 'AVERAGE_OCCUPANCY_RATE',
'ITIN_FARE': 'AVERAGE_ITIN_FARE'
}, inplace=True)
#getting value_counts for each 'ROUNDTRIP_ROUTE' in 'non_canceled_flights'
route_value_counts = non_canceled_flights['ROUNDTRIP_ROUTE'].value_counts().reset_index()
route_value_counts.columns = ['ROUNDTRIP_ROUTE', 'TOTAL_FLIGHTS']
#merging the value_counts with 'round_trips'
round_trips = pd.merge(round_trips, route_value_counts, on='ROUNDTRIP_ROUTE', how='left')
round_trips
| ROUNDTRIP_ROUTE | ORIGIN | DESTINATION | AVERAGE_ARR_DELAY | AVERAGE_DEP_DELAY | DISTANCE | AVERAGE_OCCUPANCY_RATE | AVERAGE_ITIN_FARE | TOTAL_FLIGHTS | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | ABE-ATL | ATL | ABE | 2.557604 | 6.873272 | 692.0 | 0.654332 | 557.141176 | 434 |
| 1 | ABE-CLT | ABE | CLT | 1.468127 | 3.838645 | 481.0 | 0.661096 | 486.394737 | 502 |
| 2 | ABE-DTW | ABE | DTW | 5.981855 | 13.663984 | 425.0 | 0.645392 | 385.809524 | 497 |
| 3 | ABE-FLL | ABE | FLL | 4.250000 | 5.400000 | 1041.0 | 0.605250 | 248.734375 | 40 |
| 4 | ABE-ORD | ABE | ORD | 23.666667 | 29.275000 | 654.0 | 0.649094 | 548.354839 | 320 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2930 | STL-TPA | STL | TPA | 5.073810 | 11.299287 | 869.0 | 0.658480 | 375.192982 | 421 |
| 2931 | STL-TUL | STL | TUL | 0.265432 | 6.193846 | 351.0 | 0.646308 | 392.350000 | 325 |
| 2932 | SYR-TPA | TPA | SYR | 9.500000 | 22.486111 | 1104.0 | 0.615556 | 308.512500 | 72 |
| 2933 | TPA-TTN | TTN | TPA | 7.557692 | 14.987261 | 955.0 | 0.639427 | 162.054054 | 157 |
| 2934 | VPS-XNA | VPS | XNA | 24.760000 | 20.360000 | 601.0 | 0.605600 | 270.900000 | 25 |
2935 rows × 9 columns
#calculating average_fare_bookings for each round trip route
round_trips['AVERAGE_FARE_BOOKINGS'] = (round_trips['AVERAGE_OCCUPANCY_RATE'] * 200) * round_trips['AVERAGE_ITIN_FARE']
#calculating average_baggage_fees
round_trips['AVERAGE_BAGGAGE_FEES'] = (round_trips['AVERAGE_OCCUPANCY_RATE'] * 200) * 70
#calculating average_airplane_costs for each round trip route
round_trips['AVERAGE_AIRPLANE_COST'] = 9.18 * round_trips['DISTANCE']
#creating function to calculate average arrival/delay costs based on criteria provided in prompt
def calculate_cost(delay_minutes):
if delay_minutes <= 15:
return 0
else:
return (delay_minutes - 15) * 75
#calcualting average arrival delay costs each round trip route
round_trips['AVERAGE_ARR_DELAY_COST'] = round_trips['AVERAGE_ARR_DELAY'].apply(calculate_cost)
#calcualting average arrival delay costs each round trip route
round_trips['AVERAGE_DEP_DELAY_COST'] = round_trips['AVERAGE_DEP_DELAY'].apply(calculate_cost)
round_trips
| ROUNDTRIP_ROUTE | ORIGIN | DESTINATION | AVERAGE_ARR_DELAY | AVERAGE_DEP_DELAY | DISTANCE | AVERAGE_OCCUPANCY_RATE | AVERAGE_ITIN_FARE | TOTAL_FLIGHTS | AVERAGE_FARE_BOOKINGS | AVERAGE_BAGGAGE_FEES | AVERAGE_AIRPLANE_COST | AVERAGE_ARR_DELAY_COST | AVERAGE_DEP_DELAY_COST | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ABE-ATL | ATL | ABE | 2.557604 | 6.873272 | 692.0 | 0.654332 | 557.141176 | 434 | 72911.037463 | 9160.645161 | 6352.56 | 0.0 | 0.000000 |
| 1 | ABE-CLT | ABE | CLT | 1.468127 | 3.838645 | 481.0 | 0.661096 | 486.394737 | 502 | 64310.685783 | 9255.338645 | 4415.58 | 0.0 | 0.000000 |
| 2 | ABE-DTW | ABE | DTW | 5.981855 | 13.663984 | 425.0 | 0.645392 | 385.809524 | 497 | 49799.703363 | 9035.492958 | 3901.50 | 0.0 | 0.000000 |
| 3 | ABE-FLL | ABE | FLL | 4.250000 | 5.400000 | 1041.0 | 0.605250 | 248.734375 | 40 | 30109.296094 | 8473.500000 | 9556.38 | 0.0 | 0.000000 |
| 4 | ABE-ORD | ABE | ORD | 23.666667 | 29.275000 | 654.0 | 0.649094 | 548.354839 | 320 | 71186.739718 | 9087.312500 | 6003.72 | 650.0 | 1070.625000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2930 | STL-TPA | STL | TPA | 5.073810 | 11.299287 | 869.0 | 0.658480 | 375.192982 | 421 | 49411.400758 | 9218.717340 | 7977.42 | 0.0 | 0.000000 |
| 2931 | STL-TUL | STL | TUL | 0.265432 | 6.193846 | 351.0 | 0.646308 | 392.350000 | 325 | 50715.764615 | 9048.307692 | 3222.18 | 0.0 | 0.000000 |
| 2932 | SYR-TPA | TPA | SYR | 9.500000 | 22.486111 | 1104.0 | 0.615556 | 308.512500 | 72 | 37981.316667 | 8617.777778 | 10134.72 | 0.0 | 561.458333 |
| 2933 | TPA-TTN | TTN | TPA | 7.557692 | 14.987261 | 955.0 | 0.639427 | 162.054054 | 157 | 20724.339473 | 8951.974522 | 8766.90 | 0.0 | 0.000000 |
| 2934 | VPS-XNA | VPS | XNA | 24.760000 | 20.360000 | 601.0 | 0.605600 | 270.900000 | 25 | 32811.408000 | 8478.400000 | 5517.18 | 732.0 | 402.000000 |
2935 rows × 14 columns
#merging 'ORIGIN_SIZE' based on 'ORIGIN' and 'IATA_CODE'
origin_size = filtered_airport_codes[['IATA_CODE', 'TYPE']].rename(columns={'TYPE': 'ORIGIN_SIZE'})
round_trips = pd.merge(round_trips, origin_size, left_on='ORIGIN', right_on='IATA_CODE', how='left')
round_trips.drop('IATA_CODE', axis=1, inplace=True)
#merging 'DESTINATION_SIZE' based on 'DESTINATION' and 'IATA_CODE'
destination_size = filtered_airport_codes[['IATA_CODE', 'TYPE']].rename(columns={'TYPE': 'DESTINATION_SIZE'})
round_trips = pd.merge(round_trips, destination_size, left_on='DESTINATION', right_on='IATA_CODE', how='left')
round_trips.drop('IATA_CODE', axis=1, inplace=True)
round_trips
| ROUNDTRIP_ROUTE | ORIGIN | DESTINATION | AVERAGE_ARR_DELAY | AVERAGE_DEP_DELAY | DISTANCE | AVERAGE_OCCUPANCY_RATE | AVERAGE_ITIN_FARE | TOTAL_FLIGHTS | AVERAGE_FARE_BOOKINGS | AVERAGE_BAGGAGE_FEES | AVERAGE_AIRPLANE_COST | AVERAGE_ARR_DELAY_COST | AVERAGE_DEP_DELAY_COST | ORIGIN_SIZE | DESTINATION_SIZE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ABE-ATL | ATL | ABE | 2.557604 | 6.873272 | 692.0 | 0.654332 | 557.141176 | 434 | 72911.037463 | 9160.645161 | 6352.56 | 0.0 | 0.000000 | large_airport | medium_airport |
| 1 | ABE-CLT | ABE | CLT | 1.468127 | 3.838645 | 481.0 | 0.661096 | 486.394737 | 502 | 64310.685783 | 9255.338645 | 4415.58 | 0.0 | 0.000000 | medium_airport | large_airport |
| 2 | ABE-DTW | ABE | DTW | 5.981855 | 13.663984 | 425.0 | 0.645392 | 385.809524 | 497 | 49799.703363 | 9035.492958 | 3901.50 | 0.0 | 0.000000 | medium_airport | large_airport |
| 3 | ABE-FLL | ABE | FLL | 4.250000 | 5.400000 | 1041.0 | 0.605250 | 248.734375 | 40 | 30109.296094 | 8473.500000 | 9556.38 | 0.0 | 0.000000 | medium_airport | large_airport |
| 4 | ABE-ORD | ABE | ORD | 23.666667 | 29.275000 | 654.0 | 0.649094 | 548.354839 | 320 | 71186.739718 | 9087.312500 | 6003.72 | 650.0 | 1070.625000 | medium_airport | large_airport |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2930 | STL-TPA | STL | TPA | 5.073810 | 11.299287 | 869.0 | 0.658480 | 375.192982 | 421 | 49411.400758 | 9218.717340 | 7977.42 | 0.0 | 0.000000 | large_airport | large_airport |
| 2931 | STL-TUL | STL | TUL | 0.265432 | 6.193846 | 351.0 | 0.646308 | 392.350000 | 325 | 50715.764615 | 9048.307692 | 3222.18 | 0.0 | 0.000000 | large_airport | large_airport |
| 2932 | SYR-TPA | TPA | SYR | 9.500000 | 22.486111 | 1104.0 | 0.615556 | 308.512500 | 72 | 37981.316667 | 8617.777778 | 10134.72 | 0.0 | 561.458333 | large_airport | large_airport |
| 2933 | TPA-TTN | TTN | TPA | 7.557692 | 14.987261 | 955.0 | 0.639427 | 162.054054 | 157 | 20724.339473 | 8951.974522 | 8766.90 | 0.0 | 0.000000 | medium_airport | large_airport |
| 2934 | VPS-XNA | VPS | XNA | 24.760000 | 20.360000 | 601.0 | 0.605600 | 270.900000 | 25 | 32811.408000 | 8478.400000 | 5517.18 | 732.0 | 402.000000 | large_airport | medium_airport |
2935 rows × 16 columns
#getting unique value counts in ORIGIN_SIZE
round_trips['ORIGIN_SIZE'].value_counts()
ORIGIN_SIZE large_airport 2413 medium_airport 482 Name: count, dtype: int64
#finding percent of nulls in the origin_size column
origin_size_percentage = (round_trips['ORIGIN_SIZE'].isnull().sum() / len(round_trips['ORIGIN_SIZE'])) * 100
#printing the output
print(f"Percentage of null values in 'ORIGIN_SIZE' column is {origin_size_percentage:.2f}%")
Percentage of null values in 'ORIGIN_SIZE' column is 1.36%
#getting unique value counts in DESTINATION_SIZE
round_trips['DESTINATION_SIZE'].value_counts()
DESTINATION_SIZE large_airport 2402 medium_airport 480 Name: count, dtype: int64
#finding percent of nulls in the origin_size column
destination_size_percentage = (round_trips['DESTINATION_SIZE'].isnull().sum() / len(round_trips['DESTINATION_SIZE'])) * 100
#printing the output
print(f"Percentage of null values in 'DESTINATION_SIZE' column is {destination_size_percentage:.2f}%")
Percentage of null values in 'DESTINATION_SIZE' column is 1.81%
Data Quality Insight: 1% of the rows in the ORIGIN_SIZE columm are NULL and about 2% of the rows in the DESTINATION column are NULL. Since about 82% of the tpyes in both columns are large airports we can replace these NULL values in both with large_airport.
#replacing null values in 'ORIGIN_SIZE' and 'DESTINATION_SIZE' with 'large_airport'
round_trips['ORIGIN_SIZE'].fillna('large_airport', inplace=True)
round_trips['DESTINATION_SIZE'].fillna('large_airport', inplace=True)
#creating function to calculate airport cost based on airport size
def calculate_airport_cost(size):
if size == 'medium_airport':
return 5000
elif size == 'large_airport':
return 10000
else:
return None
#creating 'ORIGIN_AIRPORT_COST' column based on 'ORIGIN_SIZE'
round_trips['ORIGIN_AIRPORT_COST'] = round_trips['ORIGIN_SIZE'].apply(calculate_airport_cost)
#creating 'DESTINATION_AIRPORT_COST' column based on 'DESTINATION_SIZE'
round_trips['DESTINATION_AIRPORT_COST'] = round_trips['DESTINATION_SIZE'].apply(calculate_airport_cost)
round_trips
| ROUNDTRIP_ROUTE | ORIGIN | DESTINATION | AVERAGE_ARR_DELAY | AVERAGE_DEP_DELAY | DISTANCE | AVERAGE_OCCUPANCY_RATE | AVERAGE_ITIN_FARE | TOTAL_FLIGHTS | AVERAGE_FARE_BOOKINGS | AVERAGE_BAGGAGE_FEES | AVERAGE_AIRPLANE_COST | AVERAGE_ARR_DELAY_COST | AVERAGE_DEP_DELAY_COST | ORIGIN_SIZE | DESTINATION_SIZE | ORIGIN_AIRPORT_COST | DESTINATION_AIRPORT_COST | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ABE-ATL | ATL | ABE | 2.557604 | 6.873272 | 692.0 | 0.654332 | 557.141176 | 434 | 72911.037463 | 9160.645161 | 6352.56 | 0.0 | 0.000000 | large_airport | medium_airport | 10000 | 5000 |
| 1 | ABE-CLT | ABE | CLT | 1.468127 | 3.838645 | 481.0 | 0.661096 | 486.394737 | 502 | 64310.685783 | 9255.338645 | 4415.58 | 0.0 | 0.000000 | medium_airport | large_airport | 5000 | 10000 |
| 2 | ABE-DTW | ABE | DTW | 5.981855 | 13.663984 | 425.0 | 0.645392 | 385.809524 | 497 | 49799.703363 | 9035.492958 | 3901.50 | 0.0 | 0.000000 | medium_airport | large_airport | 5000 | 10000 |
| 3 | ABE-FLL | ABE | FLL | 4.250000 | 5.400000 | 1041.0 | 0.605250 | 248.734375 | 40 | 30109.296094 | 8473.500000 | 9556.38 | 0.0 | 0.000000 | medium_airport | large_airport | 5000 | 10000 |
| 4 | ABE-ORD | ABE | ORD | 23.666667 | 29.275000 | 654.0 | 0.649094 | 548.354839 | 320 | 71186.739718 | 9087.312500 | 6003.72 | 650.0 | 1070.625000 | medium_airport | large_airport | 5000 | 10000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2930 | STL-TPA | STL | TPA | 5.073810 | 11.299287 | 869.0 | 0.658480 | 375.192982 | 421 | 49411.400758 | 9218.717340 | 7977.42 | 0.0 | 0.000000 | large_airport | large_airport | 10000 | 10000 |
| 2931 | STL-TUL | STL | TUL | 0.265432 | 6.193846 | 351.0 | 0.646308 | 392.350000 | 325 | 50715.764615 | 9048.307692 | 3222.18 | 0.0 | 0.000000 | large_airport | large_airport | 10000 | 10000 |
| 2932 | SYR-TPA | TPA | SYR | 9.500000 | 22.486111 | 1104.0 | 0.615556 | 308.512500 | 72 | 37981.316667 | 8617.777778 | 10134.72 | 0.0 | 561.458333 | large_airport | large_airport | 10000 | 10000 |
| 2933 | TPA-TTN | TTN | TPA | 7.557692 | 14.987261 | 955.0 | 0.639427 | 162.054054 | 157 | 20724.339473 | 8951.974522 | 8766.90 | 0.0 | 0.000000 | medium_airport | large_airport | 5000 | 10000 |
| 2934 | VPS-XNA | VPS | XNA | 24.760000 | 20.360000 | 601.0 | 0.605600 | 270.900000 | 25 | 32811.408000 | 8478.400000 | 5517.18 | 732.0 | 402.000000 | large_airport | medium_airport | 10000 | 5000 |
2935 rows × 18 columns
#Calculating average/total revenue for each roundtrip route
round_trips['AVERAGE_TOTAL_REVENUE'] = (round_trips['AVERAGE_FARE_BOOKINGS'] + round_trips['AVERAGE_BAGGAGE_FEES'])
round_trips['TOTAL_REVENUE'] = (round_trips['AVERAGE_FARE_BOOKINGS'] + round_trips['AVERAGE_BAGGAGE_FEES'])* round_trips['TOTAL_FLIGHTS']
#calculating average/total expenses for each roundtrip route
round_trips['AVERAGE_TOTAL_EXPENSE'] = (round_trips['AVERAGE_AIRPLANE_COST'] +
round_trips['AVERAGE_ARR_DELAY_COST'] +
round_trips['AVERAGE_DEP_DELAY_COST'] +
round_trips['ORIGIN_AIRPORT_COST'] +
round_trips['DESTINATION_AIRPORT_COST'])
round_trips['TOTAL_EXPENSE'] = (
(round_trips['AVERAGE_AIRPLANE_COST'] +
round_trips['AVERAGE_ARR_DELAY_COST'] +
round_trips['AVERAGE_DEP_DELAY_COST'] +
round_trips['ORIGIN_AIRPORT_COST'] +
round_trips['DESTINATION_AIRPORT_COST'])* round_trips['TOTAL_FLIGHTS'])
#calculating average/total profit for eaxch rountrip route
round_trips['AVERAGE_PROFIT'] = round_trips['AVERAGE_TOTAL_REVENUE'] - round_trips['AVERAGE_TOTAL_EXPENSE']
round_trips['TOTAL_PROFIT'] = round_trips['TOTAL_REVENUE'] - round_trips['TOTAL_EXPENSE']
round_trips
| ROUNDTRIP_ROUTE | ORIGIN | DESTINATION | AVERAGE_ARR_DELAY | AVERAGE_DEP_DELAY | DISTANCE | AVERAGE_OCCUPANCY_RATE | AVERAGE_ITIN_FARE | TOTAL_FLIGHTS | AVERAGE_FARE_BOOKINGS | ... | ORIGIN_SIZE | DESTINATION_SIZE | ORIGIN_AIRPORT_COST | DESTINATION_AIRPORT_COST | AVERAGE_TOTAL_REVENUE | TOTAL_REVENUE | AVERAGE_TOTAL_EXPENSE | TOTAL_EXPENSE | AVERAGE_PROFIT | TOTAL_PROFIT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ABE-ATL | ATL | ABE | 2.557604 | 6.873272 | 692.0 | 0.654332 | 557.141176 | 434 | 72911.037463 | ... | large_airport | medium_airport | 10000 | 5000 | 82071.682624 | 3.561911e+07 | 21352.560000 | 9267011.04 | 60719.122624 | 2.635210e+07 |
| 1 | ABE-CLT | ABE | CLT | 1.468127 | 3.838645 | 481.0 | 0.661096 | 486.394737 | 502 | 64310.685783 | ... | medium_airport | large_airport | 5000 | 10000 | 73566.024429 | 3.693014e+07 | 19415.580000 | 9746621.16 | 54150.444429 | 2.718352e+07 |
| 2 | ABE-DTW | ABE | DTW | 5.981855 | 13.663984 | 425.0 | 0.645392 | 385.809524 | 497 | 49799.703363 | ... | medium_airport | large_airport | 5000 | 10000 | 58835.196321 | 2.924109e+07 | 18901.500000 | 9394045.50 | 39933.696321 | 1.984705e+07 |
| 3 | ABE-FLL | ABE | FLL | 4.250000 | 5.400000 | 1041.0 | 0.605250 | 248.734375 | 40 | 30109.296094 | ... | medium_airport | large_airport | 5000 | 10000 | 38582.796094 | 1.543312e+06 | 24556.380000 | 982255.20 | 14026.416094 | 5.610566e+05 |
| 4 | ABE-ORD | ABE | ORD | 23.666667 | 29.275000 | 654.0 | 0.649094 | 548.354839 | 320 | 71186.739718 | ... | medium_airport | large_airport | 5000 | 10000 | 80274.052218 | 2.568770e+07 | 22724.345000 | 7271790.40 | 57549.707218 | 1.841591e+07 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2930 | STL-TPA | STL | TPA | 5.073810 | 11.299287 | 869.0 | 0.658480 | 375.192982 | 421 | 49411.400758 | ... | large_airport | large_airport | 10000 | 10000 | 58630.118098 | 2.468328e+07 | 27977.420000 | 11778493.82 | 30652.698098 | 1.290479e+07 |
| 2931 | STL-TUL | STL | TUL | 0.265432 | 6.193846 | 351.0 | 0.646308 | 392.350000 | 325 | 50715.764615 | ... | large_airport | large_airport | 10000 | 10000 | 59764.072308 | 1.942332e+07 | 23222.180000 | 7547208.50 | 36541.892308 | 1.187612e+07 |
| 2932 | SYR-TPA | TPA | SYR | 9.500000 | 22.486111 | 1104.0 | 0.615556 | 308.512500 | 72 | 37981.316667 | ... | large_airport | large_airport | 10000 | 10000 | 46599.094444 | 3.355135e+06 | 30696.178333 | 2210124.84 | 15902.916111 | 1.145010e+06 |
| 2933 | TPA-TTN | TTN | TPA | 7.557692 | 14.987261 | 955.0 | 0.639427 | 162.054054 | 157 | 20724.339473 | ... | medium_airport | large_airport | 5000 | 10000 | 29676.313996 | 4.659181e+06 | 23766.900000 | 3731403.30 | 5909.413996 | 9.277780e+05 |
| 2934 | VPS-XNA | VPS | XNA | 24.760000 | 20.360000 | 601.0 | 0.605600 | 270.900000 | 25 | 32811.408000 | ... | large_airport | medium_airport | 10000 | 5000 | 41289.808000 | 1.032245e+06 | 21651.180000 | 541279.50 | 19638.628000 | 4.909657e+05 |
2935 rows × 24 columns
#finding the 10 most profitable routes based on 'AVERAGE_TOTAL_PROFIT'
top_10_routes = round_trips.nlargest(10, 'TOTAL_PROFIT')
top_10_routes[['ROUNDTRIP_ROUTE', 'TOTAL_PROFIT']]
| ROUNDTRIP_ROUTE | TOTAL_PROFIT | |
|---|---|---|
| 2076 | JFK-LAX | 5.740261e+08 |
| 2103 | JFK-SFO | 2.832267e+08 |
| 1619 | EWR-SFO | 2.553984e+08 |
| 2273 | LGA-ORD | 2.048425e+08 |
| 2230 | LAX-SFO | 2.006740e+08 |
| 189 | ATL-LGA | 1.923855e+08 |
| 1153 | DCA-ORD | 1.877187e+08 |
| 186 | ATL-LAX | 1.788756e+08 |
| 564 | BOS-LGA | 1.643118e+08 |
| 1137 | DCA-LGA | 1.632471e+08 |
#creating bar chart for 10 Most Profitable Round Trip Routes by Total Profit
fig = px.bar(top_10_routes, x='ROUNDTRIP_ROUTE', y='TOTAL_PROFIT', text_auto='.2s',
labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'TOTAL_PROFIT': 'Profit'},
title='10 Most Profitable Round Trip Routes by Total Profit')
fig.update_layout(
xaxis=dict(tickfont=dict(size=12), title_standoff=10),
yaxis=dict(tickfont=dict(size=12), title_standoff=10),
title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))
fig.show()
#creating bar chart for 10 Most Profitable Round Trip Routes by Average Profit
fig = px.bar(top_10_routes, x='ROUNDTRIP_ROUTE', y='AVERAGE_PROFIT', text_auto='.2s',
labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'AVERAGE_PROFIT': 'Profit'},
title='10 Most Profitable Round Trip Routes by Average Profit')
fig.update_layout(
xaxis=dict(tickfont=dict(size=12), title_standoff=10),
yaxis=dict(tickfont=dict(size=12), title_standoff=10),
title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))
fig.show()
#creating bar chart for 10 Most Profitable Round Trip Routes by Total Revenue
fig = px.bar(top_10_routes, x='ROUNDTRIP_ROUTE', y='TOTAL_REVENUE', text_auto='.2s',
labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'TOTAL_REVENUE': 'Revenue'},
title='10 Most Profitable Round Trip Routes by Total Revenue')
fig.update_layout(
xaxis=dict(tickfont=dict(size=12), title_standoff=10),
yaxis=dict(tickfont=dict(size=12), title_standoff=10),
title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))
fig.show()
#creating bar chart for 10 Most Profitable Round Trip Routes by Total Expense
fig = px.bar(top_10_routes, x='ROUNDTRIP_ROUTE', y='TOTAL_EXPENSE', text_auto='.2s',
labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'TOTAL_EXPENSE': 'Expense'},
title='10 Most Profitable Round Trip Routes by Total Expense')
fig.update_layout(
xaxis=dict(tickfont=dict(size=12), title_standoff=10),
yaxis=dict(tickfont=dict(size=12), title_standoff=10),
title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))
fig.show()
#creating bar chart for 10 Most Profitable Round Trip Routes by Total Flights
fig = px.bar(top_10_routes, x='ROUNDTRIP_ROUTE', y='TOTAL_FLIGHTS', text_auto='.2s',
labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'TOTAL_FLIGHTS': 'Flights'},
title='10 Most Profitable Round Trip Routes by Total Flights')
fig.update_layout(
xaxis=dict(tickfont=dict(size=12), title_standoff=10),
yaxis=dict(tickfont=dict(size=12), title_standoff=10),
title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))
fig.show()
#creating bar chart for 10 Most Profitable Round Trip Routes by Average Total Fare Bookings
fig = px.bar(top_10_routes, x='ROUNDTRIP_ROUTE', y='AVERAGE_FARE_BOOKINGS', text_auto='.2s',
labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'AVERAGE_FARE_BOOKINGS': 'Flights'},
title='10 Most Profitable Round Trip Routes by Average Total Fare Bookings')
fig.update_layout(
xaxis=dict(tickfont=dict(size=12), title_standoff=10),
yaxis=dict(tickfont=dict(size=12), title_standoff=10),
title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))
fig.show()
#creating bar chart for 10 Most Profitable Round Trip Routes by Average Airplane Costs
fig = px.bar(top_10_routes, x='ROUNDTRIP_ROUTE', y='AVERAGE_AIRPLANE_COST', text_auto='.2s',
labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'AVERAGE_AIRPLANE_COST': 'Airplane Costs'},
title='10 Most Profitable Round Trip Routes by Average Airplane Costs')
fig.update_layout(
xaxis=dict(tickfont=dict(size=12), title_standoff=10),
yaxis=dict(tickfont=dict(size=12), title_standoff=10),
title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))
fig.show()
Top 5 Recommended Round Trip Routes To Invest In
#creating a dataframe with only the specified roundtrip routes
selected_routes = ['JFK-LAX', 'JFK-SFO', 'EWR-SFO', 'ATL-LAX', 'DCA-ORD']
selected_top_routes = top_10_routes[top_10_routes['ROUNDTRIP_ROUTE'].isin(selected_routes)].copy()
#calculating the number of flights needed to break even on the upfront airplane cost
selected_top_routes['FLIGHTS_TO_BREAK_EVEN'] = 90000000 / selected_top_routes['AVERAGE_PROFIT']
#rounding breakeven values to whole numbers
selected_top_routes['FLIGHTS_TO_BREAK_EVEN'] = selected_top_routes['FLIGHTS_TO_BREAK_EVEN'].round(0)
selected_top_routes[['ROUNDTRIP_ROUTE', 'FLIGHTS_TO_BREAK_EVEN']]
| ROUNDTRIP_ROUTE | FLIGHTS_TO_BREAK_EVEN | |
|---|---|---|
| 2076 | JFK-LAX | 991.0 |
| 2103 | JFK-SFO | 1182.0 |
| 1619 | EWR-SFO | 854.0 |
| 1153 | DCA-ORD | 1772.0 |
| 186 | ATL-LAX | 1609.0 |
Number of Flights to Breakeven for Top 5 Recommended Flights:
#creating bar chart for Flights to Breakeven for Top 5 Recommended Flights
fig = px.bar(selected_top_routes, x='ROUNDTRIP_ROUTE', y='FLIGHTS_TO_BREAK_EVEN', text_auto='.2s',
labels={'ROUNDTRIP_ROUTE': 'Round Trip Route', 'FLIGHTS_TO_BREAK_EVEN': 'Flights'},
title='Flights to Breakeven for Top 5 Recommended Flights')
fig.update_layout(
xaxis=dict(tickfont=dict(size=12), title_standoff=10),
yaxis=dict(tickfont=dict(size=12), title_standoff=10),
title=dict(x=0.5, y=0.85, xanchor='center', yanchor='top'))
fig.show()
Key Performance Indicators (KPI’s) To Track (From Dataset):
Additional Key Performance Indicators (aKPI’s) To Track (Not Inlcuded In Dataset):
#creating metadata as a dictionary
metadata_dict = {
'ROUNDTRIP_ROUTE': 'The roundtrip flight route between the origin and destination airports.',
'AVERAGE_ARR_DELAY': 'The average arrival delay time in minutes for a rountrip route.',
'AVERAGE_DEP_DELAY': 'The average departure delay time in minutes for a rountrip route.',
'AVERAGE_OCCUPANCY_RATE': 'The average percentage of seats filled for a rountrip route.',
'AVERAGE_ITIN_FARE': 'The average rountrip itinerary fare for a rountrip route.',
'TOTAL_FLIGHTS': 'The total number of flights for a rountrip route.',
'AVERAGE_FARE_BOOKINGS': 'The average total itinerary fare revenue. Caculated from (Average Occupancy Rate * 200) multiplied by Average Itinerary Fare.',
'AVERAGE_BAGGAGE_FEES': 'The average baggage fee revenue collected per rountrip route. This includes origin and destination in calculation.',
'AVERAGE_AIRPLANE_COST': 'The average operational cost per flight for the airplane servicing a rountrip route. This incudes the cost of fuel, oil, maintenance, crew, depreciation, insurance, etc. (per mile).',
'AVERAGE_ARR_DELAY_COST': 'The average arrival delay cost incurred for a rountrip route.',
'AVERAGE_DEP_DELAY_COST': 'The average departure delay cost incurred for a rountrip route.',
'ORIGIN_SIZE': 'The size of the origin airport - either medium or large.',
'DESTINATION_SIZE': 'The size of the destination airport - medium or large.',
'ORIGIN_AIRPORT_COST': 'The fixed operating cost for the origin airport.',
'DESTINATION_AIRPORT_COST': 'The fixed operating cost for the destination airport.',
'AVERAGE_TOTAL_REVENUE': 'The average total revenue generated per roundtrip route.',
'TOTAL_REVENUE': 'The total revenue generated by the total flights recorded for a rountrip route.',
'AVERAGE_TOTAL_EXPENSE': 'The average total cost per flight on this route.',
'TOTAL_EXPENSE': 'The total cost of the total flights recorded for a rountrip route.',
'AVERAGE_PROFIT': 'The average profit per flight on this route.',
'TOTAL_PROFIT': 'The total profit from the total flights recorded for a rountrip route.',
'FLIGHTS_TO_BREAK_EVEN': 'The estimated number of flights needed for this route to break-even with respects to the upfront cost of the airplane ($90 million).'
}
#creating metadata dataframe from metadata dictionary
metadata = pd.DataFrame(list(metadata_dict.items()), columns=['New Column', 'Description'])
# Setting the maximum column width to 'None' to make it easier to view metadata Descriptions
pd.options.display.max_colwidth = None
metadata
| New Column | Description | |
|---|---|---|
| 0 | ROUNDTRIP_ROUTE | The roundtrip flight route between the origin and destination airports. |
| 1 | AVERAGE_ARR_DELAY | The average arrival delay time in minutes for a rountrip route. |
| 2 | AVERAGE_DEP_DELAY | The average departure delay time in minutes for a rountrip route. |
| 3 | AVERAGE_OCCUPANCY_RATE | The average percentage of seats filled for a rountrip route. |
| 4 | AVERAGE_ITIN_FARE | The average rountrip itinerary fare for a rountrip route. |
| 5 | TOTAL_FLIGHTS | The total number of flights for a rountrip route. |
| 6 | AVERAGE_FARE_BOOKINGS | The average total itinerary fare revenue. Caculated from (Average Occupancy Rate * 200) multiplied by Average Itinerary Fare. |
| 7 | AVERAGE_BAGGAGE_FEES | The average baggage fee revenue collected per rountrip route. This includes origin and destination in calculation. |
| 8 | AVERAGE_AIRPLANE_COST | The average operational cost per flight for the airplane servicing a rountrip route. This incudes the cost of fuel, oil, maintenance, crew, depreciation, insurance, etc. (per mile). |
| 9 | AVERAGE_ARR_DELAY_COST | The average arrival delay cost incurred for a rountrip route. |
| 10 | AVERAGE_DEP_DELAY_COST | The average departure delay cost incurred for a rountrip route. |
| 11 | ORIGIN_SIZE | The size of the origin airport - either medium or large. |
| 12 | DESTINATION_SIZE | The size of the destination airport - medium or large. |
| 13 | ORIGIN_AIRPORT_COST | The fixed operating cost for the origin airport. |
| 14 | DESTINATION_AIRPORT_COST | The fixed operating cost for the destination airport. |
| 15 | AVERAGE_TOTAL_REVENUE | The average total revenue generated per roundtrip route. |
| 16 | TOTAL_REVENUE | The total revenue generated by the total flights recorded for a rountrip route. |
| 17 | AVERAGE_TOTAL_EXPENSE | The average total cost per flight on this route. |
| 18 | TOTAL_EXPENSE | The total cost of the total flights recorded for a rountrip route. |
| 19 | AVERAGE_PROFIT | The average profit per flight on this route. |
| 20 | TOTAL_PROFIT | The total profit from the total flights recorded for a rountrip route. |
| 21 | FLIGHTS_TO_BREAK_EVEN | The estimated number of flights needed for this route to break-even with respects to the upfront cost of the airplane ($90 million). |